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Abstract 

We consider in this work the probabilistic approach to model-based dignosis when applied to electri- 
cal power systems (EPSs). Our probabilistic approach is formally well-founded, as it based on Bayesian 
networks and arithmetic circuits. We investigate the diagnostic task known as fault isolation, and pay 
special attention to meeting two of the main challenges — model development and real-time reasoning 
— often associated with real-world application of model-based diagnosis technologies. To address the 
challenge of model development, we develop a systematic approach to representing electrical power sys- 
tems as Bayesian networks, supported by an easy-to-use specification language. To address the real-time 
reasoning challenge, we compile Bayesian networks into arithmetic circuits. Arithmetic circuit evaluation 
supports real-time diagnosis by being predictable and fast. In essence, we introduce a high-level EPS 
specification language from which Bayesian networks that can diagnose multiple simultaneous failures 
are auto-generated, and we illustrate the feasibility of using arithmetic circuits, compiled from Bayesian 
networks, for real-time diagnosis on real-world EPSs of interest to NASA. The experimental system is 
a real-world EPS, namely the Advanced Diagnostic and Prognostic Testbed (ADAPT) located at the 
NASA Ames Research Center. In experiments with the ADAPT Bayesian network, which currently con- 
tains 503 discrete nodes and 579 edges, we find high diagnostic accuracy in scenarios where one to three 
faults, both in components and sensors, were inserted. The time taken to compute the most probable 
explanation using arithmetic circuits has a small mean of 0.2625 milliseconds and standard deviation of 
0.2028 milliseconds. In experiments with data from ADAPT we also show that arithmetic circuit evalua- 
tion substantially outperforms joint tree propagation and variable elimination, two alternative algorithms 
for diagnosis using Bayesian network inference. 


1 Introduction 

In this work, we apply probabilistic model-based diagnosis techniques to a real-world electrical power system 
(EPS), namely the Advanced Diagnostic and Prognostic Testbed (ADAPT) [1], A Bayesian network (BN) 
model of the ADAPT electrical power system plays a central role. This ADAPT BN represents health of 
sensors and subsystem components explicitly, and is auto-generated from a high-level system model of the 
ADAPT EPS. This BN is compiled, off-line, into an arithmetic circuit which is then evaluated on-line. We 
believe that this ADAPT case study clearly demonstrates how arithmetic circuits offer a scalable inference 
technique with potential for real-time evaluation in aircraft and spacecraft. 

BNs have recently gained great popularity as an approach to representing multi-variate probability dis- 
tributions [2]. BNs play a central role in a wide range of automated reasoning applications, for example in 
model-based diagnosis [3, 4, 5], medical diagnosis [6, 7], natural language understanding [8], probabilistic 
risk analysis [9], intelligent data analysis [10, 11], and error correction coding [12, 13]. 

A major point in this work is our case study on an electrical power system, ADAPT, that is representative 
of those found in aerospace vehicles. One of our main contributions is the integration of different techniques, 
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both existing and novel, in order to address a real world problem, thereby obtaining an approach that scales 
up to handle real world challenges in probabilistic model-based diagnosis. 

Several aspects of this work make it different from previous efforts that utilize Bayesian networks for 
EPS diagnosis [4, 5]: A first contribution is our expression of EPS components and structure, using a 
novel high-level language, coupled with auto-generation of Bayesian networks from models expressed in this 
language. This approach supports the iterative development of probabilistic diagnostic models for large EPSs, 
including diagnostic system models that would be extremely tedious to hand-construct even for Bayesian 
network experts. The benefit of this approach to developers and engineers that are not, or only vaguely, 
familiar with Bayesian network appears to be even greater. 

It is important to achieve real-time performance in many EPS health monitoring applications in aerospace 
[14, 15]. As a second contribution, we would like to highlight our compilation approach to probabilistic 
diagnosis, specifically the off-line compilation of Bayesian networks to arithmetic circuits [16, 17], which are 
then used for on-line diagnosis. In experiments, we have here shown that this approach provides high-quality 
diagnostic results on ADAPT scenarios. In addition, we have shown that performance is substantially better 
than alternative probabilistic inference algorithms, specifically variable elimination and clique (or join) tree 
propagation. 

We believe that this work is significant for the following reasons. Electrical power systems are of crucial 
importance in aerospace as well as in numerous other areas of society [18, 1], Our results in this work provide 
an argument for the feasibility of probabilistic, model-based diagnosis of EPSs. In particular, we address two 
real-world challenges of the model-based approach, especially as systems grow, namely (i) the difficulty of 
model construction (the modelling challenge) and (ii) scalability or performance concerns due to the inherent 
difficulty of probabilistic reasoning including diagnosis [19, 20, 21] (the real-time reasoning challenge). In the 
area of model construction, we provide automated generation of a Bayesian network from an EPS schematic 
and automated construction of an arithmetic circuit from the Bayesian network. This arithmetic circuit, 
which typically is large but has simple a semantics, supports real-time diagnosis on-line in the following 
two ways. First, it results in more predictable times. Second, it results in much faster inference. These 
two benefits are important to us, given the real-time requirements of aircraft and spacecraft avionics [15]. 
More generally, our approach to integrating a diagnostic system into the real-time setting is an example of 
“embedding AI into a real-time system” [14]. This addresses the scalability concerns mentioned above. 

The rest of this work is structured as follows. After briefly introducing fundamentals of Bayesian networks 
and arithmetic circuits in Section 2, we discuss electrical power systems and ADAPT in particular in Section 

3. We then present, using a simple EPS example, the high level specification language and model in Section 

4. In Section 5 we consider the high level specification of a large portion of ADAPT. We then discuss 
Bayesian network modelling and auto-generation (Section 6 and Section 7 respectively), and compilation 
to arithmetic circuits (Section 8). We note that the latter two types of models (Bayesian networks and 
arithmetic circuits) are both auto-generated in our approach, thus reducing the amount of manual effort 
required. Finally, we report on experimental results in Section 9, both on real-world and synthetic data, 
before concluding in Section 10. 

2 Preliminaries 

We now briefly present the underlying formalisms of our probabilistic model-based reasoning approach: 
Bayesian networks and arithmetic circuits. 

2.1 Bayesian Networks 

Bayesian networks (BNs) represent multivariate probability distributions and are used for reasoning and 
learning under uncertainty [2], Probability theory and graph theory form the basis of BNs: Roughly speaking, 
random variables are represented as nodes in a directed acyclic graph (DAG), while conditional dependencies 
are represented as graph edges. A key point is that a BN, whose graph structure often reflects a domain’s 
causal structure, is a compact representation of a joint probability table if its graph is relatively sparse. 
Both discrete and continuous random variables can be represented in BNs; our main emphasis in this work 
is on BNs with discrete random variables. Each discrete random variable (or node) X has a finite number 
of states {aq, . . .x m } and is parameterized by a conditional probability table (CPT). 

Let X be the BN nodes, E C X the evidence nodes, and e the evidence. Different probabilistic queries 
can now be formulated (and supported by different algorithms as we will further discuss below). These 
queries assume that all nodes in E are clamped to values e. Computation of most probable explanation 
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(MPE) amounts to finding a most probable explanation over the remaining nodes R = X — E, or MPE(e). 
Computation of marginals (or beliefs) amounts to inferring the posterior probabilities over one or more 
query nodes Q C R, specifically BEL(Q, e) where Q € Q. Marginals are used to compute most likely values 
(MLVs) simply by picking, in BEL(Q, e), a most likely state. Computation of the maximum a posteriori 
probability (MAP) generalizes MPE computation and finds a most probable instantiation over nodes QCR, 
MAP(Q, e). Finally, we note that MAP can be approximated using MPE and MLV, and we will denote this 
using MAP mpe (Q, e) and MAP ML v(Q, e) respectively. MAP mpe (Q, e) is the result of disregarding the 
nodes in R not in Q, and MAP mlv (Q, e) is the result of aggregating MLV(Q, e) of all Q € Q. These two 
approximations are of interest because of the greater computational complexity of MAP [21] compared to 
MPE and marginals [19, 20]. . 

Different BN inference algorithms can be used to perform the above computations. We distinguish 
between exact and inexact algorithms. Exact algorithms include join tree propagation [22, 23, 24, 25], 
conditioning [26, 27], variable elimination [28, 29, 30], and arithmetic circuit evaluation [16, 17]. Inexact 
algorithms, and in particular stochastic local search algorithms, have been used to compute MPEs [31, 32, 
33, 34, 35] as well as MAPs [36, 21] in Bayesian networks. 

In resource-bounded systems, including real-time avionics systems, there is a strong need to align the 
resource consumption of diagnostic computation with resource bounds [14, 15]. Consequently, the compila- 
tion approach join tree propagation and arithmetic circuit evaluation — is attractive in resource-bounded 
systems. The main advantage of compilation is that a significant amount of the work required for inference 
is performed once offline; this effort is then amortized over many online queries. Online inference is typically 
much faster when using a compilation approach, and the inference times often have much smaller variance. 
In this work we emphasize compilation into arithmetic circuits, which we present next. 

2.2 Arithmetic Circuits 

Arithmetic circuits (ACs), as discussed in [37, 16], are here used to perform probabilistic inference. The 
compilation from BNs to ACs is based on the following connection between BNs and multi-linear functions. 
With each Bayesian network, we associate a corresponding multi-linear function (MLF) that computes the 
probability of evidence. For example, the network in Figure 1 — in which variables A and B are Boolean and 
C has three values- induces the following MLF : 

Aai ^bi Ac* 0 ai 0 bl O Cl |ai ,6i A A ai A C2 9 ai Qbx $c 2 |ai ,bi A 

A Aa 2 Ab 2 A C2 0a 2 0b 2 V C2 |o 2 ,h 2 A A 02 Af> 2 A C3 0 a2 0b 2 0 C3 | tt2 {, 2 • 

The terms in the MLF are in one-to-one correspondence with the rows of the network’s joint distribution. 
Assume that all indicator valuables \ x have value 1 and all parameter variables 0 X \ U have value Pr(a;|tt). 
Each term will then be a product of probabilities which evaluates to the probability of the corresponding 
row from the joint. The MLF will add all probabilities from the joint, for a sum of 1.0. To compute the 
probability Pr(e) of evidence e, we need a way to exclude certain terms from the sum. This removal of terms 
is accomplished by carefully setting certain indicators to 0 instead of 1, according to the evidence. 

The fact that a network’s MLF computes the probability of evidence is interesting, but the network MLF 
has exponential size. However, if we can factor the MLF into something small enough to fit within memory, 
then we can compute Pr(e) in time that is linear in the size of the factorization. The factorization will take 
the form of an AC, which is a rooted DAG, where an internal node represents the sum or product of its 
children, and a leaf represents a constant or variable. In this context, those variables will be indicator and 
parameter variables. An example AC is depicted in Figure 1. We refer to this process of producing an AC 
from a Bayesian network as compiling the network. While a BN is more compact than an AC, for example as 
seen in Figure 1, there are in fact several advantages associated with using an AC for probabilistic inference, 
as we will discuss shortly. 

Once we have an AC for a network, we can compute Pr(e) for given evidence e by assigning appropriate 
values to leaves and then computing a value for each internal node in bottom-up fashion. The value for 
the root is then the answer to the query. We can also compute answers to many other queries (a posterior 
marginal for each network variable, a posterior marginal for each network family, etc.) by performing a 
second downward pass [16] analogous to the outward pass of the jointree algorithm. Hence, many queries 
can be computed simultaneously in time linear in the size of the AC. MPE(e) may be computed in a 
similar manner, by using maximization nodes instead of addition nodes in the AC. Another main point is 
that the upward and downward passes may then be repeated for as many evidence sets as desired, without 
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Figure 1: A Bayesian network (with nodes A , B , and C ) and a corresponding arithmetic circuit compiled 
from the Bayesian network. 


recompiling. Performing inference using an AC is therefore divided into two phases, an offline phase, which 
compiles the network into an AC and is run once, and an online phase, which answers many queries each 
time it is invoked, and which may be invoked multiple times. 

We close this section by noting the close relationship between the jointree algorithm [22, 24] and ACs, since 
the data structures involved in this algorithm embed an AC in a very precise sense [38]. Other compilation 
algorithms have been developed based on tabular elimination [37], weighted model counting [39], and ADD 
elimination [17]. These algorithms can have an exponential advantage over jointree by exploiting structure 
in the parameters of the Bayesian network [40]. 

3 Electrical Power Systems 

We discuss the importance of electrical power systems (EPSs) in aerospace applications and describe an EPS 
testbed, ADAPT, that is the subject of this case study. 

3.1 Electrical Power System Challenges in Aerospace and at NASA 

Electrical power systems (EPSs) play an essential role in aerospace vehicles [18, 1]. The electrical power 
system may be thought of as the circulatory system of an aerospace vehicle. In the human body, the circula- 
tory system delivers the necessary nutrients to the constituent elements. Similarly, the EPS delivers energy 
to subsystems in order to power required vehicle functions such as life support, propulsion, communications, 
guidance, navigation, and control. Loss of electrical power to these and similar subsystems can result in 
severe repercussions for the vehicle, personnel, or mission. 

Unfortunately, electrical power systems have been implicated in several aerospace vehicle incidents, acci- 
dents and mishaps. In one accident, the left Power Conversion and Distribution Unit (PCDU) on a Boeing 
717 failed, resulting in the loss of the left AC and DC busses. The most likely cause was determined to be 
the failure of a transient suppression diode, which allowed AC current to contaminate the DC circuits of the 
PCDU. Shorted diodes were found on one of the circuit boards and several circuit cards showed evidence 
of heavy sooting. One serious and several minor injuries were sustained by passengers using the emergency 
evacuation slides after the pilot ordered the crew and passengers to evacuate the airplane subsequent to 
smelling electrical smoke (see NTSB report number NYC03FA067). In another incident involving the PCDU 
of a Boeing 717, a tantalum capacitor and a permanent magnet generator input transformer failed, resulting 
in smoke in the cabin and an emergency landing and evacuation (NTSB report ATL04IA085). 

The Electric Propulsion Space Experiment (ESEX) mission, launched and operated in early 1999, ended 
prematurely when the spacecraft experienced a catastrophic battery failure. During the first battery charge, 
a charging circuit instability occurred due to high internal impedance. After the eighth firing, large battery 
voltage fluctuations were observed over a 24 hour period, eventually stabilizing at a low voltage. The failure 
was most likely the result of electrolyte leakage which caused a short circuit to the battery case. Subsequent 


4 



Figure 2: The ADAPT electrical power system lab at the NASA Ames Research Center. 
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Figure 3: Schematic of a portion of the ADAPT testbed, showing one of three power storage parts (for 
Battery 1, left) and one of two load banks (Load bank 1, right). Detailed information about loads is given 
in Table 2. Figure 4 shows pictures of loads as well as how they are connected to load banks. 


discharging through the short circuit led to increases in battery temperature and pressure due to gasses 
released from electrolyte decomposition. This resulted in a breach of the battery case, entry of super-lreated 
gas into the flight unit, and eventual venting into space [41], On January 14 2005, an Intelsat operated 
communications satellite suffered a total loss after a sudden and unexpected electrical power system anomaly. 
The failure of Intelsat 804’s high voltage power system was likely the result of a high current event in the 
battery circuitry triggered by an electrostatic discharge (see http://sat-nd.com/failures/index.html/). 
A battery failure also occurred on the Mars Global Surveyor, which last communicated with Earth on 
November 2, 2006. In this case, a software error oriented the spacecraft to an angle that exposed it to too 
much sunlight. This caused the battery to overheat and ultimately led to the depletion of both batteries 
(see http://mpfwww.jpl.nasa.gov/mgs/newsroom/20070413a.html). New challenges will be presented by 
advanced battery chemistries which require careful supervision. 

These are just a few examples of the faults that can arise in EPSs. In addition to failures in PCDUs and 
batteries, failures in wiring, electronics and electromechanical devices are also well-known. Much attention 
has recently been given to wiring in commercial aircraft, with the cancellation of hundreds of flights because 
of concerns about the bundling and routing of wires. Power semiconductor electronics and electromechanical 
devices will play an ever greater role as aerospace vehicles move towards all electric designs. Given the 
prevalence and importance of EPSs, it is vital to develop effective health management techniques for real- 
time aerospace operations, including at NASA. 

3.2 ADAPT: Advanced Diagnostic and Prognostic Testbed 

Figure 2 shows the Advanced Diagnostics and Prognostics Testbed (ADAPT) lab; see also http://ti.arc. 
nasa.gov/adapt/ and [1]. ADAPT, which has capabilities for power generation, power storage, and power 
distribution, is a fully operational electrical power system that is representative of such systems in aircraft 
and spacecraft. Figure 3 presents a schematic with a representative battery and load bank from ADAPT. 
Figure 4 shows ADAPT loads. ADAPT is configured to achieve fault-tolerance, and contains three batteries 
and two load banks. One battery can provide power to at most one load bank at any given time, and each 
battery is wired such that it can power any one of the two load banks. In Figure 3, for example, for Battery 
1 to power Load bank 1, relay EY141 must be closed. For Battery 1 to power Load bank 2, on the other 
hand, relay EY144 must be closed. Relays EY141 and EY144 cannot both be closed at the same time. 

Different types of components and sensors used in ADAPT are presented in Table 1. Relays, which are 
commanded to close and open in order to control power, have prefix EY (in Figure 3) and health modes 
healthy, stuck or failed open, and stuck or failed closed. A position sensor, also presented in Table 1, reports 
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Figure 4: The loads in ADAPT along with their sensor and connection information. For example, consider 
the large fan FAN1 shown at the bottom left in the picture. FANl’s labels say that the fan is connected to 
LIB in the schematic (see Figure 3) and has one sensor ST515 (see Table 2). 


on the status of a relay. As concrete examples, consider in Figure 3 relay EY170 that controls power to load 
LI A; it also has a position (or touch) sensor ESH170. In our application, each of EY170 and ESH170 are 
represented by random variables including health status random variables with states as represented in Table 
1. For example, EY170’s health random variable has states {healthy, stuckOpen, stuckClosed}. Upstream 
of relay EY170 is a current sensor IT167; the states of its health variable are {healthy, readCurrentLo, 
readCurrent.Hi} as shown in the table. Further information regarding our probabilistic modelling of EPS 
components and structure is provided in Section 6 and Section 7. 

There are two load banks in ADAPT, each has an AC part and a DC part. Load bank 2 is very similar 
to Load bank 1, the loads are just plugged into different locations. Each load is connected at a fixed place in 
the power distribution unit. In other words, for the purposes of this work there is no ambiguity as to which 
“power outlet” a load is “plugged into”. At this time, there are mostly AC loads in ADAPT; see Table 2 
as well as Figure 4. We have included the possibility of DC loads (currently there are 2 DC loads, one for 
each load bank). To convert DC power from the batteries into AC power used by the AC loads, ADAPT 
has two inverters, one per load bank. A failed inverter breaks power transmission to the AC loads; see the 
stuckOpen failure mode in Table 1. 


4 High Level Models 

Our approach to probabilistic model based diagnosis involves four stages. In the first stage, we describe 
the EPS using a high -level modeling language. The main purpose of this language is to make specifying 
the EPS easy and less error-prone. In the second stage, we apply a program to automatically convert the 
high-level specification into a Bayesian network. Putting the model into the form of a Bayesian network 
allows us to leverage a large body of existing work on Bayesian network inference techniques. In the third 
stage, we compile the Bayesian network into an arithmetic circuit. This stage represents the application of a 
specific technique (arithmetic circuits) for performing inference in Bayesian networks, which in the resource- 
bounded, real-time context has significant advantages over other techniques [14, 15]. All stages up to this 
point have taken place offline, before the EPS is put into actual use. The fourth stage involves applying 
algorithms to the arithmetic circuit to perform inference online, when the EPS is deployed in an aircraft or 
spacecraft. By this time, as much computational effort as possible has been performed offline, leaving much 
less computation to be performed online. In this and the next sections, we provide more detail on each stage, 
beginning in this section with the novel high-level specification language. 

A high-level specification is a sequence of statements. The syntax of our high level specification language 
is given in Table 3. In Table 3, <name> is an identifier and <p> is a probability. Each statement defines 
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Part 

Prefix 

Mode (Healthy/Faulty) 

States 

Battery 

BATT 

Healthy 

Voltage failure or drain 

healthy 

stuckDisabled 

Circuit 

breaker 

ISH 

Healthy 

Stuck or failed open 
Stuck or failed closed 

healthy 

stuckOpen 

stuckClosed 

Inverter 

INV 

Healthy 
Switched off 

healthy 

stuckOpen 

Relay 

EY 

Healthy 

Stuck or failed open 
Stuck or failed closed 

healthy 

stuckOpen 

stuckClosed 

Voltage 

sensor 

El 

Healthy 

Reading stuck low 
Reading stuck high 

healthy 

readVoltageLo 

readVoltageHi 

Current 

sensor 

IT 

Healthy 

Reading stuck low 
Reading stuck high 

healthy 

readCurrentLo 

readCurrentHi 

Position 

sensor 

ISH 

Healthy 

Reading stuck open 
Reading stuck closed 

healthy 

stuckOpen 

stuckClosed 


Table 1: Different EPS parts along with their modes and the corresponding states of the health node for the 
part. 


ID 

Type 

Relay 

Description 

Load ID 

Measurements (Sensor IDs) 

L1A 

AC 

EY170 

3 light bulbs 

LGT6 

Temperatures (TE500, TE501, TE502) 
and Light sensor (LT500) 

LIB 

AC 

EY171 

Big fan 

FAN1 

RPM (ST515) 

L1C 

AC 

EY172 

Small fan 

FAN3 

None 

LID 

AC 

EY173 

1 light bulb 

LGT8 

None 

LIE 

AC 

EY174 

Water pump 

PMP2 

Flow rate (FT525) 

L1F 

AC 

EY175 

1 light bulb 

LGT4 

Temperature (TE511) 

L1G 

DC 

EY183 

Electromechanical 

DC1 

None 

L1H 

DC 

EY184 

None 

N/A 

N/A 


Table 2: Loads and their sensors (where applicable) for Load bank 1 of the ADAPT electrical power system. 


a component, which can either be a source (battery), a basic component, a sensor, or a sink (load). For 
brevity, we do not describe here some statements defining more complicated sensors. The general idea is that 
electricity flows from sources through basic components to sinks. Along the way, sensors attempt to monitor 
what is happening, and various failures can occur at each component. For each component, we define its 
name, its type (e.g., source, load, breaker, relay, sensorCurrent, sensorVoltage), the probability that 
the component will fail, 1 and a set of neighboring components. For a source, the set of neighbors is empty; for 
a basic component or a sink, we list all upstream neighbors, neighbors that lie between the component and 
a source of electricity; for a sensor, we list only the component to which the sensor is attached. These sets 
of neighbors serve to define the topology of an EPS; in fact a directed acyclic graph is induced as discussed 
in Section 7. 

Figure 5 depicts a very simple example of an electrical power system, in which there is only one battery 
(source) and only one load (sink). Electricity flows from the battery batt to the load Id through a relay 
rly. There is a current sensor currSens attached to the wire wirel between the source and the relay and 
a touch sensor attached to the relay itself. There is also a voltage sensor voltSens attached to the second 
wire, wire2 . Table 4 contains a description of this small EPS in our specification language. The third line, 
for example, defines a current sensor curSens with failure probability 0.0003 and attached to component 
wirel (which happens to be a wire defined in the second line); the fourth line defines a relay rly with failure 

1 A s described, all failures for a given component have equal probability, but the syntax can easily be extended to assign 
differing probabilities to different kinds of failures. 





<eps> ::= < component >+ 

<component> ::= (<source> | <basic> | <sensor> | <sink>) 
<source> ::= <name> "source" <p> 

<basic> ::= <name> <btype> <p> <name>+ 
<sensor> ::= <name> <stype> <p> <name> 

<sink> ::= <name> "sink" <p> <name>+ 
<btype> ::= "load" | "wire" | "inverter" | "breaker" | "relay" 
<stype> ::= "sensorCurrent" | "sensorVoltage" | "sensorTouch" 


Table 3: The syntax of our novel specification language for electrical power systems. 



Figure 5: A small electrical power system; it is described using our specification language in Table 4. 


probability 0.0003 and receiving electricity from component wirel; and the fifth line defines a touch (or 
position) sensor touchSens with failure probability 0.0002 and attached to component rly. 

The main advantage of the high-level specification is ease-of-use. One can specify a model by listing 
which components exist in the system, and for each, its type, failure probability, and neighbors. All of this 
information can be obtained directly from schematics and hardware manuals. Consequently, the modeling 
task at the specification language level does not require guesswork or any knowledge of Bayesian networks 
or arithmetic circuits. 

Components differ from each other in some ways that are not represented explicitly in the specification 
language, because the information can be inferred from the component’s type. For example, some component 
types, such as a circuit-breaker, accept a command to open or close, whereas some, such as a wire, do not. 
Similarly, different components may suffer different types of failures as presented in Table 1. For example, 
a wire can only fail in a stuck-open state, whereas a circuit breaker can be stuck-open or stuck-closed. 2 
This information is added during the BN auto-generation stage (see Section 7), reflecting our BN modelling 
approach, which is what we discuss next. 


5 High Level Model for ADAPT 

Using our specification language, the ADAPT EPS is described using statements for the following compo- 
nents: 3 sources (batteries), 20 sinks (loads), 16 wires (we only need to describe a wire if it has a sensor 
attached or if we want to model failures in wires), 2 inverters, 9 circuit breakers, 25 relays, 17 current and 

2 Sometimes a subtle distinction between “fail” and “stuck” is made, which we currently do not make in the BN; instead 
we for simplicity call all failures “stuck”. (For example, “failed open” means that the component was closed and then opened 
without any command. “Stuck open”, on the other hand, means that the component was already open, and failed to close 
when commanded to do so.) 


batt : 

source : 

0.0001 


wirel : 

wire : 

0.0000 

batt ; 

curSens : 

sensorCurrent : 

0.0003 

wirel ; 

rly : 

relay : 

0.0003 

wirel ; 

touchSens : 

sensorTouch : 

0.0002 

rly ; 

wire2 : 

wire : 

0.0000 

rly ; 

voltSens : 

sensorVoltage : 

0.0002 

wire2 ; 

Id : 

sink : 

0.0001 

wire2 ; 


Table 4: A small EPS, shown in Figure 5, described using our specification language. 
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load sensors, 16 voltage sensors, 33 position (touch) sensors, and 6 more advanced sensors (these advanced 
sensors are not described here for sake of brevity). We now discuss the high level model of ADAPT in some 
detail. In particular, we present the part that models the ADAPT schematic shown in Figure 3. ADAPT 
loads are depicted in Figure 4. 

5.1 Modelling of Batteries and Distribution Network 

The specification language follows the components and structure of an electrical power circuit very closely, 
hence models like the ADAPT model we discuss next can be read directly off schematics. ADAPT contains 
three batteries and distribution networks. Due to their similarities, we only present the specification for one 
of them (Battery 1) here; see also the schematic in Figure 3: 
batteryl : battery : 0.0005; 

wirel35 : wire : 0.0000 : batteryl; 

el35 : sensorVoltage : 0.0005 : wirel35; 

breaker_eyl36_op : breaker : 0.0005 : wirel35; 

ishl36 : sensorTouch : 0.0005 : breaker_eyl36_op ; 

wirel40 : wire : 0.0000 : breaker_eyl36_op ; 

el40 : sensorVoltage : 0.0005 : wirel40; 

it 140 : sensorCurrent : 0.0005 : wire 140; 

relay_eyl41_cl : relay : 0.0005 : wirel40; 

eshl41a : sensorTouch : 0.0005 : relay_eyl41_cl; 

relay_eyl44_cl : relay : 0.0005 : wirel40; 

eshl44a : sensorTouch : 0.0005 : relay_eyl44_cl; 

We now consider the part that connects Battery 1 with Load bank 1: 

wirel42 : wire : 0.0000 : relay_eyl41_cl relay_ey241_cl relay_ey341_cl ; 

el42 : sensorVoltage : 0.0005 : wirel42; 

relay_eyl60_cl : relay : 0.0005 : wirel42; 

eshl60a : sensorTouch : 0.0005 : relay_eyl60_cl ; 

wirel61 : wire : 0.0000 : relay_eyl60_cl ; 

el61 : sensorVoltage : 0.0005 : wirel61; 

itl61 : sensorCurrent : 0.0005 : wirel61; 

5.2 Modelling of Load Banks and Loads 

ADAPT contains two load banks with loads. Due to the similarities between loads, we only discuss the 
specification of loads L1A, LIB, L1C, LIE, and L1G for Load bank 1 here. First, we present the part of the 
specification that concerns the branch that leads to and includes the DC load L1G: 
breaker_eyl80_op : breaker : 0.0005 : wirel61; 

ishl80 : sensorTouch : 0.0005 : breaker_eyl80_op; 

wirel81 : wire : 0.0000 : breaker_eyl80_op ; 

el81 : sensorVoltage : 0.0005 : wirel81; 

itl81 : sensorCurrent : 0.0005 : wirel81; 

relay_eyl83_cl : relay : 0.0005 : wirel81; 

eshl83 : sensorTouch : 0.0005 : relay_eyl83_cl ; 

loadl83 : sink : 0.0005 : relay_eyl83_cl ; 

What follows below is the branch containing the Load bank 1 inverter: 

breaker_eyl62_op : breaker : 0.0005 : wirel61; 

ishl62 : sensorTouch : 0.0005 : breaker_eyl62_op; 

invl : inverter : 0.0005 : breaker_eyl62_op; 

wirel65 : wire : 0.0000 : invl; 

el65 : sensorVoltage : 0.0005 : wirel65; 

breaker_eyl66_op : breaker : 0.0005 : wirel65; 

ishl66 : sensorTouch : 0.0005 : breaker_eyl66_op; 

wirel67 : wire : 0.0000 : breaker_eyl66_op ; 

el67 : sensorVoltage : 0.0005 : wirel67; 

itl67 : sensorCurrent : 0.0005 : wirel67; 

Here are the AC loads LIB, L1C, and L1G: 
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relay_eyl71_cl : relay : 0.0005 : wirel67; 

eshl71 : sensorTouch : 0.0005 : relay_eyl71_cl ; 

loadl71 : sink : 0.0005 : relay_eyl71_cl ; 

f anl_speed_st515 : sensorCurrent Three : 0.0005 : loadl71; 

relay_eyl72_cl : relay : 0.0005 : wirel67; 

loadl72 : sink : 0.0005 : relay_eyl72_cl ; 

eshl72 : sensorTouch : 0.0005 : relay_eyl72_cl ; 

loadl74 : sink : 0.0005 : relay_eyl74_cl ; 

pump2_f low_ft525 : sensorCurrent Three : 0.0005 : load!74; 


relay_eyl74_cl : relay 

eshl74 : sensorTouch : 

relay_eyl70_cl : relay 

We finally consider AC 

eshl70 : sensorTouch : 

Ioadl70_bulb0 : sink : 

boxl_lightl_temp_te500 : 
loadl70_bulbl : sink : 

boxl_light2_temp_te501 : 
Ioadl70_bulb2 : sink : 


: 0.0005 : wirel67; 

0.0005 : relay_eyl74_cl ; 

: 0.0005 : wirel67; 

load L1A. 

0.0005 : relay_eyl70_cl ; 

0.0005 : relay_eyl70_cl ; 

sensorCurrent : 0.0005 : 

0.0005 : relay_eyl70_cl ; 

sensorCurrent : 0.0005 : 

0.0005 : relay_eyl70_cl ; 


Ioadl70_bulb0; 
loadl70_bulbl ; 


boxl_light3_temp_te502 : sensorCurrent : 0.0005 : Ioadl70_bulb2 ; 

lightl_level_lt500 : sensorCurrentCount : 0.0005 : Ioadl70_bulb0 loadl70_bulbl Ioadl70_bulb2 ; 

There are here three light bulbs, and this is reflected as three loads in the specification, one for each 
light bulb: Ioadl70_bulb0, loadl70_bulbl, and Ioadl70_bulb2. There are two types of sensors: (i) 
A temperature sensor, which is glued to a light bulb. There is a “current” sensor for each of these: 
boxl_lightl_temp_te500, boxl_light2_temp_te501, and boxl_light3_temp_te502. Clearly, a current 
sensor is not the same as a temperature sensor technically, but they both really measure current, and by 
using different failure probabilities we account for their differences, (ii) A light sensor, which is a few inches 
away from all the light bulbs. This light sensor senses three light bulbs. This is represented by another 
“current” sensor, light l_level_lt500, attached to all three bulbs. What is different from the tempera- 
ture sensors, for example, is that this sensor is attached multiple components and reports the number of 
components that were on. Also, there is just a single failure state. In the specification language, we note 
how a sensor is attached to components. Sensors are specified very similarly to how normal components are 
specified. 


6 Modelling Electrical Power Systems using Bayesian Networks 

A main contribution in this work is our systematic modelling of EPSs using BNs. BNs provide a probabilistic 
semantics for our high-level specification language, and in addition they support efficient inference including 
compilation into arithmetic circuits. In order to discuss our modelling approach as well as the supporting 
auto-generation algorithm and software, we partition the set of BN nodes X into subsets H , E , C, P and 
R as follows: 

• Health nodes (H), where H = He U Hs and HcflHg = 0. Here, He (component health nodes) 
represent health of the EPS components and Hg (sensor health nodes) represent the health of the 
EPS sensors. 

• Evidence nodes (E), where E = Eq U Eg and EcH Eg = 0. Here, Eq (command nodes) represent 
the commands to the EPS, while Eg (sensor nodes) represent sensor readings from the EPS. 

• Connection nodes (C), where C = C\j U C s and Cyf I Cg = 0. Here, C LJ (source connection nodes) 
represent connection to a source (battery) in an EPS; Cg (sink connection nodes) represent connection 
to a sink (load) in an EPS. 

• Presence nodes (P), where P = Pc U Py and P^n Py = 0- Here, Py (voltage presence nodes) 
represent voltage, similar to water pressure, provided by a source (battery) in an EPS. Pc (current 
propagation or presence nodes) represent flow, similar to water flow, of electrical current from a source 
(battery) to a sink (load) in an EPS. In our case, there is presence of voltage iff there is a closed 
connection to one or more batteries, therefore one may work with either Cjj or Py. 
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Table 5: Bayesian network (top) and MPE computation using the Bayesian network (bottom) for the small 
EPS specified in Table 4. In the Bayesian network we show both the actual BN node names ( Health_ batt, 
Health_ld, . . .) as well as the mathematical notation (Ch, Cd, ■ ■ ■) used to describe the auto-generation 
algorithm. 
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• Remaining EPS nodes (R): Nodes that are not health, evidence, connection, or presence nodes. If X 
is the set of all BN nodes, then R = X —H — E — C — P. 

The above node partitioning is important for several reasons. It allows us to: state different probabilistic 
queries of interest; discuss our EPS modelling approach using BNs (both the topology as well as the individual 
nodes associated with different EPS components); and clearly present the experimental protocol. 

In Section 2 we discussed, given query variables Q C X and evidence e, three probabilistic queries: 
MAP(Q, e), MAPmpe(Q, e), and MAPmlv(Q, e ). By introducing the above partitioning, we can put Q = 
Hq, Q = Hg, or Q = H and obtain a total of nine different diagnostics queries. As an example, Q = Hg 
is of interest in sensor validation, where the main focus is on qualifying and disqualifying sensors [42], 
for instance voltage sensors, current sensors, fuel sensors, or altitude sensors. In the rest of this work we 
emphasize, in the interest of brevity, only Q — H and in particular MAPmpe(-H\ e). 

A key contribution in this work is our modeling of EPSs using Bayesian networks. An EPS presents 
two different but closely related problems, namely a voltage presence problem and a current flow problem. 
Voltage may propagate from a battery towards the loads. For current to flow, there must be voltage present 
and in addition the EPS circuit needs to be closed, which typically happens when an EPS load is turned 
on and all other relays between the load and a battery are also closed. This bidirectional voltage-current 
propagation problem is different from, and more complicated than, the unidirectional flow problem posed by 
digital circuits implementing boolean logic. Such digital electronic circuits have been extensively studied in 
the model-based diagnosis and Bayesian network literature [2]. 

Table 5 provides a simple example of our EPS modelling approach. This BN represents the simple 
electrical power system introduced in Figure 5 and in Table 4. 3 Here, Hq = { Health_batt , Health_ld} 
and Hg = {Health_curSens, Health_voltSens , Health_touchSens }; Eq ={Command_relay} and Eg = 
{ Sensor_curSens , Sensor_voltSens, Sensor _touchSens} . The topology of the ADAPT BN, which currently 
contains over 500 nodes, is analogous to this BN’s topology. A key point in this example is how the 
integration of voltage presence nodes (Py = {Voltage batt, Voltage_wirel, Voltage_rly, Voltage_wire2}), 
sink connection nodes (Cg = { ToSink_wirel , ToSink_rly, ToSink_wire2, ToSink_ld}), and the current flow 
node (Pc = { Current _ wire 1}) help solve the bidirectional voltage-current propagation problem. Many 
nodes, including current flow nodes, can be pruned (and indeed have been here) because they are leaf 
nodes and not involved in sensors. Another key point is how sensors, for example the voltage sensor (nodes 
Health_voltSens, Sensor _voltSens) and the current sensor (nodes Health_ curSens, S ens or _ curSens), are 
integrated into the overall BN topology. 

We now consider inference as illustrated in Table 5. Suppose that e ={Command_rly = cmdClose, 
Sensor_curSens = readCurrentLo, Sensor_voltSens = readVoltageHi, Sensor_touchSens = readClosed } (see 
Table 5). This gives MAPmpe(-H\c) = {Health _batt = healthy, Health_ld = healthy, Health_ cur Sens = 
stuckCurrentLo, Health_voltSens = healthy}. In words, if the command and sensor readings, except for 
Sensor_ curSens = readCurrentLo, suggest that power is supplied to the load, then the MPE diagnosis 
is that all components and sensors are healthy, except for the current sensor, where Health_ curSens = 
stuckCurrentLo. It is reassuring that there is agreement between the MPE diagnosis and common sense. 

While our modelling approach can be used when manually constructing BNs for EPSs, it is even more 
powerful when automated, and we now turn to how we have formalized it in the form of an auto-generation 
algorithm. 


7 Auto-generation of Bayesian Networks 

In this section, we discuss how a BN is auto-generated from a high-level specification model. This is the 
second stage in our approach to probabilistic model- based diagnosis. The conversion runs in a loop, which 
processes one component from the specification in each iteration. Before embarking on the discussion of 
the auto-generation algorithm, we note that there are fundamentally two ways to go about this: Either the 
components in the specification need to be traversed in a particular order, so that when the edges between a 
node and its parents are created, the parent nodes are guaranteed to already exist. Alternatively, we need to 
take two passes, where the first pass creates the nodes and the second pass adds the edges. For simplicity, we 
take the former approach here, and traverse the specification in a certain sequential order. Such a sequential 
order is guaranteed to exist under the assumption that the underlying EPS can be described using a directed 
acyclic graph (DAG). There is a clear mapping from a high-level specification to a DAG: In each component 

! In fact, this BN was auto-generated, as discussed in Section 7, from the specification in Table 4. 
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statement (see Table 3), the first <name> represents a node, and the <name>+ part represents its parents 
(assuming this is not a source statement in the specification, which is trivial since it is a root node in the 
DAG). Under the assumption that there exists such a DAG, there exists a sequential high-level specification, 
since it is well-known from graph theory that any DAG can be topologically (or sequentially) sorted. 

The auto-generation algorithm can now be summarized as follows: We iterate over the components in 
the specification and for each generate a set of BN nodes and a set of BN edges. Each time the algorithm 
creates a BN node for a component, it places the node into the appropriate set among He, H s , E Cl Eg, 
Cu, Cs, Pc, Pv, and R, as we illustrate below. Sensors are handled differently from other components, 
which are fundamentally handled all the same. 

The processing of a sensor is somewhat different from the processing of other components, so we treat 
sensors separately, after first discussing other components. As an example, Figure 6(a) depicts the part of a 
BN corresponding to a relay C. For the component C, the auto-generation algorithm generates six nodes in 
the BN: 

• Ch £ He, with values {healthy, stuckOpen, stuckClosed} , indicates C’s health state. Ch has a CPT 
set according to C’s failure probability as defined in the high level specification. 

• C m € Ee, with values { cmdOpen , cmdClose}, indicates the command being sent to the relay. This 
value will always be known prior to inference, since it is set according to the command being issued to 
the relay (typically, it is cmdClose) . Therefore, probabilities in this CPT are not important; C m has 
a uniform CPT. 

• Cd £ R, with values {open, closed}, indicates whether C is currently closed. If Ch = healthy, then Cd 
indicates closed iff C m = cmdClose. Otherwise, if Ch is stuckOpen ( stuckClosed ), then Cd indicates 
open (closed). 

• C r £ Cjj, with values {open, closed}, indicates whether there is a closed path from C to a battery 
(source). C r = closed iff Cd = closed A Vat [N r = closed] where N iterates over all of C’s upstream 
neighbors 4 . 

• Cfc € Cs, with values {open, closed}, indicates whether there is a closed path from C to a load (sink). 
Cfc = closed iff Cd = closed A Vat [Nk = closed] where N iterates over all of C’s downstream neighbors. 

• C c £ Pc, with values {currentLo, currentHi} , indicates whether current is flowing through C. C c = 
currentHi iff C r = closed and Ck = closed. 

For C r and Ck, the disjunction is cascaded to prevent the CPT from becoming too large. This same 
template applies to all non-sensor components with a few minor modifications. For example, a source can 
set C r to be equivalent to Cd', a sink can set Ck to be equivalent to Cd', a wire, which does not accept 
commands, will always set C m to cmdClosed (or omit C m from the model); and different component types 
may have different types of failures. 

Figure 6(b) depicts the part of the BN corresponding to a current sensor S, which is attached to a node 
such as C c of a component C. The auto-generation algorithm creates two nodes in the BN corresponding to 
S : 

• Sh £ Hs, with values {healthy, stuckCurrentLo, stuckCurrent.Hi} , indicates S’ s health state. Sh has a 
CPT set according to S’ s failure probability as defined in the high-level specification. 

• S s £ Es, with values { readCurrentLo , readCurrentHi} , indicates S’ s two-state discretized sensor read- 
ing. If Sh = healthy , then S s indicates closed iff C c = currentHi. Otherwise, if Sh is stuckCurrentLo 
(stuckCurrentHi) , then S s indicates readCurrentLo ( readCurrentHi ). 

This same template applies to all sensor components (except some more complicated sensors, which are 
beyond the scope of this work) with a few minor modifications. For example, different sensors are attached 
to different nodes in C: current sensors are attached to C c , voltage sensors are attached to C r , while touch 
sensors are attached to Cd- 

After the BN generation step discussed above, there is a BN pruning step. Pruning takes place based 
on information about query nodes (He and Hs) as well as about evidence nodes (Eq and Es). It is 

4 A neighbor of C is upstream of C if it is located between C and a source in the high-level specification. A neighbor of C 
is downstream of C if it is located between C and a sink in the high-level specification. 
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Figure 6: The part of the BN corresponding to (a) a relay and (b) a current sensor. 


well-known that pruning can make inference in Bayesian networks more tractable. A common pruning 
technique involves removing leaf nodes that are not part of the evidence (Eq U E$) or the query (He, H$, 
or He U Hs) [43]. In Table 5, some of the nodes have been pruned compared to Figure 6(a). Specifically, 
nodes corresponding to C r , Ck, and C c are pruned in Table 5. In other words, for the relay shown in 
Table 5 we have the following correspondence with the non-pruned nodes in Figure 6(a): Ch = Health_rly , 
C m = Command_rly , Cd = Closed_rly. How can we determine to prune C r , Ck, and C c (referring to 
Figure 6(a)), but not prune S s = Sensor_curSens and Sh = Health_curSens (referring to Figure 6(b) and 
Table 5)? Here, S s = Sensor _ curSens is a variable for which we assert evidence, while S), = Health_ curSens 
is a query variable. Evidence and query variables are never pruned. We only prune non-evidence, non-query 
variables that are leaves, or which become leaves as a result of other pruning. Consequently, all of C’s nodes 
are pruned except the following: Ch = Health_ rly which is a query variable; C m = Command_ rly which is 
an evidence variable; and Cd = Closed_ rly which is neither, but it cannot be pruned in this case, because 
it has descendent that is an evidence variable, namely the touch sensor variable Sensor_ touchSens. 


8 Compilation to Arithmetic Circuits 

We now very briefly summarize the compilation of Bayesian networks to arithmetic circuits (ACs). This 
compilation is the third stage in our approach to probabilistic model based diagnosis. Prior to compilation, 
we modify the BN’s CPTs to store pointers to AC nodes rather than numbers. For example, if 0.1 is stored 
in a particular slot of some CPT, then this number would be replaced with a pointer to a single AC node 
(sink) labeled with 0.1. Also prior to compilation, for each BN variable, we add a new table over just that 
variable representing the values of that variable. For example, variable X with values 0 and 1 would generate 
a table over X where the first slot contains a pointer to an AC node (sink) labeled with Ao and the second 
slot contains a pointer to an AC node (sink) labeled with Ai. 

After these two preprocessing steps, we run a slightly modified version of standard variable elimination 
(VE) [29, 30]. The only difference occurs when the standard version wishes to add or multiply two numbers. 
In each of these situations, the standard algorithm will identify two slots A and B in tables, add (multiply) 
the two numbers residing there, and store the result back into some slot C of some table. When the modified 
algorithm looks into A and B, it finds pointers to AC nodes a and (3 rather than numbers. Instead of 
performing the arithmetic operation, the modified algorithm creates a new AC node 7 labeled with “+” 
or “*”, makes a and /3 children of 7, and stores a pointer to 7 into C. Upon completion, standard VE 
yields a single table containing a single slot containing a number. The modified algorithm will be the same, 
except that rather than a number, we will have a pointer to an AC node, which is the root of the compiled 
arithmetic circuit. 

By exploiting local structure, this modified VE algorithm can yield an arithmetic circuit that is much 
smaller than exponential in treewidth. If one pays attention to how the CPTs of the Bayesian network 
representing EPSs are auto-generated as described in Section 7, it is easy to see that many of these CPTs 
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will be small and deterministic. Arithmetic circuit compilation has been shown to perform well on many 
such BNs, and as we shall see, the ADAPT BN is no exception [17, 44]. 

9 Experimental Results 

Probabilistic inference is the fourth and final stage in our probabilistic model-based diagnosis approach, and 
the only one that needs to be performed on-line. We now discuss probabilistic inference experiments based 
on an ADAPT BN with 503 discrete nodes and 579 edges. In the ADAPT BN, the number of states per node 
ranges from 2 to 4 with an average of 2.23 and a median of 2. Experimental results as discussed here reflect 
that data was divided into two sets: real-world data from ADAPT and synthetic data that was automatically 
generated from the ADAPT BN. The purpose of the experiments with real-world data was to characterize 
the diagnostic quality of the ADAPT BN. The purpose of the experiments with synthetic data was to 
understand the performance of arithmetic circuit evaluation versus alternative BN inference algorithms, 
variable elimination and join tree propagation in particular. For arithmetic circuit evaluation, we used the 
ACE system to compile an ADAPT BN into an arithmetic circuit and to evaluate that arithmetic circuit 
(see http : //reasoning, cs . ucla. edu/ace/ regarding ACE). The timing measurements reported here were 
made on a PC with an Intel 4 1.83 GHz processor, 1 GB RAM, and Windows XP. 

9.1 Experiments using Electrical Power System Data 

9.1.1 Design 

For experimentation using real-world data, EPS scenarios were generated using the ADAPT EPS at NASA 
Ames. This combined hardware and software facility is unique in its capability to produce real-world EPS 
data for testing diagnostic reasoners, while at the same time giving experimenters access to the gold stan- 
dard (or true underlying state) for experiments. These scenarios, which are summarized in Table 7, cover 
component failures, sensor failures, and both component and sensor failures. In addition, each scenario 
contains one, two, or three faults. Finally, and in order to stress-test our probabilistic reasoner, we did not 
restrict inserted faults to discrete faults only. We also inserted continuous faults, specifically faults of the 
form “stuck at x” , “noise StdDev = x” , or “drift slope = x” , with Since our probabilistic models do 

not contain continuous random variables, experiments with continuous faults cannot be diagnosed exactly, 
but they are still of great interest and included in many of the experiments reported on below. 

In each scenario, ADAPT’s initial state was as follows: Circuit breakers were commanded closed; the 
corresponding command variables in £p were clamped to cmdClose in evidence e. Relays were commanded 
open; the corresponding relay variables in Ec were clamped to cmdOpen in e. In this initial state, all health 
nodes H are deemed healthy when computing MAP(H), MAPmpe(H), or MAPmlv(-H). Continuous 
data, in particular continuous sensor readings in Ec, were discretized before being used for clamping the 
appropriate discrete random variables in the ADAPT model. To keep the experimental protocol consistent 
across scenarios, all inserted faults were persisted until the end of the experiments. The main diagnostic 
query MAP mpe (H), for which results are presented in Table 7, was also taken towards the end of a scenario. 

9.1.2 Results 

The results of the experiments with real-world data from ADAPT are summarized in Table 7. Each scenario is 
presented in one or more rows of the table, along with faults inserted and the diagnostic results computed for 
queries MAPmpe(-H\ e). Since H contains 128 variables, reflecting the health status of 128 EPS components 
and sensors, we only show the variables found to be non-healthy in Table 7. Temporal progressions of sensor 
readings, for a varying subset of sensors, for eight of the sixteen scenarios are presented in Table 6. 

9.1.3 Discussion 

Our main observations regarding these experiments are as follows. We see in Table 7 that the different 
diagnostic queries correctly diagnose a majority of these component and sensor failure scenarios. In fact, 
there is an exact match in 10 of the 16 scenarios. Even in cases where there is not exact agreement, the 
diagnosis is either partly matching or at least reasonable as we will see in the following. We now discuss in 
more detail results of experiments, in particular experiments in which an exact match was not obtained. 

In Experiment 305, a single failure to open for ESH175’s feedback sensor was inserted, see Table 7. In 
other words, the relay sensor (or the touch sensor) says that relay ESH175 is open while it is in fact closed. 
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Experiment 311 



Experiment 443 



Experiment 451 



Experiment 309 
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Experiment 441 




Experiment 452 



Table 6: Results for eight fault-injection experiments using ADAPT. Time, measured in sample number, is 
shown on the x-axis, while sampled sensor values for temperature, voltage, current, rotations per minute 
(RPMs), etc. are shown on the y-axis. 
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ID 

Faults Inserted in ADAPT 

Computed Diagnosis 

Match 

304 

Relay EY260 failed open 

Health relay ey260 cl = stuckOpen 

Yes 

305 

Relay feedback sensor ESH175 failed open 

Health relay eyl75 cl = stuckOpen 

Yes 

306 

Circuit breaker ISH262 tripped 

Health breaker ey262 op = stuckOpen 

Yes 

308 

Voltage sensor E261 failed low 

Health e261 = stuckVoltageLo 

Yes 

309 

Battery BATT1 voltage low 

Health batteryl = stuckDisabled 

Yes 

310 

Inverter INV1 failed off 

Health invl = stuckOpen 

Yes 

311 

Light sensor LT500 failed low 

Health lt500 = stuckLow 

Yes 

441 

Relay EY160 stuck open 
Big fan ST515 stuck at 0 RPM 

Health relay eyl60 cl = stuckOpen 

Partly 

442 

Current sensor IT261 noise StdDev = 5 
Relay feedback sensor ESH172 stuck at 0 
Current sensor IT140 stuck at 100 

Health it261 = stuckCurrentHi 
Health eshl72 = stuckOpen 

Partly 

443 

Current sensor IT281 drift slope = 2 

Relay EY244 stuck closed 

Big fan ST516 stuck at -10 RPM 

Health it281 = stuckCurrentHi 
Health relay ey244 cl = stuckClosed 

Partly 

445 

Voltage sensor E235 stuck at 0.3 

Relay feedback sensor ESH344A stuck closed 

Inverter INV2 failed off 

Health e235 = stuckVoltageLo 
Health relay ey344 cl = stuckClosed 
Health inv2 = stuckOpen 

Partly 

447 

Voltage sensor E161 failed low 
Current sensor IT167 failed low 

Health el 61 = stuckVoltageLo 
Health itl67 = stuckCurrentLo 

Yes 

449 

Voltage sensor E140 failed low 
Voltage sensor E161 failed low 

Health el40 = stuckVoltageLo 
Health el 61 = stuckVoltageLo 

Yes 

450 

Inverter IN VI failed off 

Big fan ST515 stuck at 600 RPM 

Health invl = stuckOpen 

Health fanl speed st515 = stuckMid 

Partly 

451 

Relay EY171 failed open 
Light sensor LT500 failed low 

Health relay eyl71 cl = stuckOpen 
Health U500 = stuckLow 

Yes 

452 

Light bulb TE500 failed off 
Temperature sensor TE501 failed low 

Health loadl70 bulbl = stuckDisabled 

Partly 


Table 7: 

MAPmpe 

ADAPT. 


Inserted faults versus diagnostic results — computed using the most probable diagnosis 
(H, e ) — for different fault scenarios (with IDs 304, 305, . . . ) for the electrical power systemtestbed 


Since this and the other relays upstream of LGT4 are closed, LGT4 is powered as indicated by temperature 
sensor TE511’s increasing temperature — starting at time 90 — as shown in Table 6. ACE computed a 
correct diagnosis ESH175 relay sensor failed or stuck open, see Table 7. 

In Experiment 309, a single battery failure was inserted in BATT1. Specifically, the battery BATT1 
failed with low voltage around time 150. The effect of this failure is a general drop in light, temperature, 
and RPMs as reflected in Table 6. This fault is correctly diagnosed by ACE (see Table 7). 

In Experiment 311, LT500’s light sensor was failed low around time 120. This is reflected in Table 6 as 
follows: The AC voltage from the inverter goes high around time 55, and the light sensor and the temperature 
sensors indicate that the light bulbs are on starting at time 80. A little after time 120, LT500 goes to zero, 
however. Since the temperatures appear to continue to rise (see TE500, TE501, and TE502 in Table 6), a 
light sensor fault as computed by ACE is very reasonable and in fact the correct diagnosis. 

In Experiment 441 in Table 7, both EY160 and ST515 were failed. However, one of these faults (namely 
relay EY160 stuck open) is sufficient to explain all observations. The reason for this is that EY160 controls 
power to all loads on load bank 1, including the fan sensed by ST515. Since EY160 is upstream of and 
controls the power to ST515, see Figure 3, the stuck at 0 fault of ST515 is consistent with the single-fault 
EY160 stuck open diagnosis computed by ACE, and in fact has a greater probability than the double faults 
actually inserted. In other words, the ST515 failure is masked by the EY160 failure. 

Experiments 442 and 443 have continuous faults inserted (change in noise StdDev and drift respectively) 
that are currently beyond the scope of our discrete probabilistic model. In Experiment 442, there are no 
sensors on the affected load, making it difficult to detect whether (i) the relay has failed open, thus turning 
off the load, or (ii) the relay feedback sensor has failed open. (The BN is currently not modelling the exact 
current draw in the system). So, if the relay failed open and turned off the attached load there would be 
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a drop in current being drawn from the battery because there are fewer loads. But we are not discretizing 
the current nodes in this way, sometimes making it difficult to distinguish between relay failure and relay 
position sensor failure. 

In Experiment 443, IT281 starts drifting at time 100 and ST516 gets stuck at -10 RPM at time 130. 
Both of these failure as easy to see in Table 6. Here, 2 of the 3 faulty components were correctly isolated in 
spite of continuous faults being inserted. We now consider the one fault not caught: This fault was inserted 
in ST516, a fan on Load bank 2, which was not commanded on during this experiment. In other words, the 
fact that ST516 was neither on nor commanded to be on made the abnormally low sensor reading of -10 
RPM harder to detect. Another issue was the discretization in the BN, where the faulty sensor reading of 
-10 is binned with the correct sensor reading of 0. Thus, even though the diagnosis misses ST516’s failure, 
a more fine-grained discretization could have caught this. 

In Experiment 445, 2 of the 3 faulty components were correctly isolated, and the only difficulty was due 
to the continuous fault inserted, namely E235 stuck at 0.3, discretized in the AC to stuck low. 

In Experiment 450, two faults were inserted: inverter INV1 failed and fan ST515 got stuck at 500RPM. 
When the inverter failed, all downstream power was disrupted. So E165, E167, ST165, LT500, IT167 go 
to low values. The inverter failure around time 100 is perhaps easiest to see in Table 6 if we compare the 
voltage downstream of the inverter (EI165 and EI167 both show minimal voltage after around time 100) 
with the voltage upstream of the inverter (EI161 shows nominal voltage). On the load side, we note that 
TE500 and TE501 are increasing till around time 100, at which time they start decreasing. This behavior 
is also consistent with INV1 failure at time 100; the diagnosis Health invl = stuckOpen is correct. The 
failure of ST515, which is powered through INV1, is clearly visible in Table 6. ST515, which should also 

have gone to a low value (because the fan does not have power anymore and therefore is not spinning), was 
stuck reading that its nominal value was around 600, which lead to the partly correct diagnosis stuckMid. 
The diagnosis is here essentially correct, however due to the discretization the diagnosis is approximate. A 
more fine-grained discretization could have improved the approximation. 

In Experiment 451, two faults EY171 failed open and LT500 failed low were inserted. The LT500 fault 
manifested itself in the drop of the measured lighting level for LT500 around time 110. However, the 
temperatures (as measured by TE500, TE501, and TE502) keep rising, suggesting power is still flowing. 
The EY171 fault manifested itself in ST515’s dropping RPM around time 100; clearly, EY171 failed open is 
an explanation of this, since EY171 controls ST515. ACE computed the correct diagnosis of LT500 sensor 
failure and EY171 failed open. 

In Experiment 452, two faults were inserted, for Bulb 0 (with sensor TE500) and TE501 (for Bulb 1). 
Bulb 0 was failed off, while TE501 was failed low. Towards the end of this experiment, light sensor LT500 
falls from « 43 to as 32. This lower value of ~ 32 strongly suggests that only two bulbs are on, in other 
words that one bulb, out of the three bulbs present, had failed. Temperature sensor TE500 started falling, 
indicating that the bulb associated with that sensor was off. Then TE501 went to 0 while the light sensor 
reading remained the same, indicating that temperature sensor TE501 was likely also faulty. At the time 
of the diagnosis, we have the following evidence: TE500 reads high (however, its derivative is negative 
indicating that the bulb is off - but that is not in the discrete model); TE501 reads low, TE502 reads high ; 
and LT500’s sensor reading indicates that two bulbs are lit. Thus, based on the evidence provided to ACE, 
it finds a single fault of stuckDisabled for Bulb 1. This diagnosis of Health_locidl70_bulbl = stuckDisabled 
is a direct result of the TE501 and LT500 readings. However, what was inserted was two faults, for Bulb 
0 and TE501. This highlights two issues. First, the discretization does not perfectly capture the signature 
of a bulb being off. Specifically, the bulb is still warm from having previously been on, leading the TE500 
value to be above the threshold defined for on. A second issue is that temporal aspects are not captured 
by taking one time slice near end of run; in this case there are temporal clues that point toward the correct 
diagnosis. 

We note that there are several different but related phenomena underlying the mismatches in Table 7. 
Due to our knowledge of the faults inserted into ADAPT, we are in a position to discuss these different 
phenomena in more detail. First, and reflecting the challenging nature of the fault scenarios that can be 
created using ADAPT, continuous faults (as inserted in experiments 441, 442, 443, 445, 448, and 450) are 
simply beyond the scope of our currently discrete probabilistic model. A second phenomenon is the following. 
A probabilistic model allows one to distinguish between what is possible versus impossible, and among what 
is possible, compute the probabilities for different possible explanations. However, there is no guarantee 
that the inserted faults are part of a unique MPE for the given evidence for E. For example, three faults 
may have been inserted into ADAPT, but there may be an explanation with one or two faults that has the 
same or higher joint probability, given the evidence for those faults. Experiment 441 is a good example of 
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Inference 
Time (ms) 

M 

VE 

PE 

ACE 

Mari 

JTP 

ginals 

ACE 

Minimum 

19.30 

0.2235 

9.792 

0.5721 

Maximum 

40.21 

2.5411 

65.34 

5.9228 

Median 

19.81 

0.2260 

10.52 

0.6006 

Mean 

20.13 

0.2625 

11.01 

0.7854 

St. Dev. 

1.554 

0.2028 

4.101 

0.6970 


Table 8: Results for different inference algorithms (VE, ACE, and CTP) when computing MPEs and mar- 
ginals using synthetic data generated from the ADAPT BN. 


this effect. Third, there are faults that could have been detected had more fine-grained discretizations of 
random variables been used. Experiment 452 provides an example of this, since the drop in the reading 
of the temperature sensor TE501 was quite dramatic and indicative of a sensor failure rather than only a 
failure in the light bulb TE500. A fourth phenomenon is that there might be too few, improperly placed, 
or inadequate sensors to distinguish between different faults. Many of the mismatches in these experiments 
could have been detected had more appropriate sensing been used; a detailed discussion of sensor placement 
is beyond the scope of this work, however. 

In summary, we have observed strong performance for our probabilistic model in these controlled ex- 
periments with ADAPT. We also note that a richer way of presenting diagnostic results would be helpful 
but non-trivial to provide. Specifically, it would be useful to have access to all non-zero explanations and 
their probabilities, not just the most probable explanation but explanations with lower probabilities. These 
experimental results also motivate several future research directions as discussed in Section 10. 

9.2 Experiments using Simulated Data 

In order to understand how arithmetic circuit evaluation performs in comparison to other BN inference 
algorithms in the ADAPT setting, a large number of scenarios were automatically generated and used in 
experiments as discussed in the following. 

9.2.1 Design 

In order to better understand the performance of arithmetic circuit evaluation (ACE), we performed com- 
parative experiments with variable elimination (VE) and join tree propagation (JTP). Simulated data was 
created by a program that (i) generated a set of failure scenarios according to the probabilities of the ADAPT 
BN’s health nodes H, and (ii) generated evidence by doing stochastic simulation for each failure scenario. 
These evidence sets were then used as evidence in the three different inference systems, and inference was 
performed as presented below. 

9.2.2 Results 

Results from the experiments are summarized in Table 8. Both MPEs and marginals were computed for 200 
simulated evidence sets generated from the ADAPT BN. 

9.2.3 Discussion 

The main points, which are in line with previous results on a smaller version of the ADAPT BN [45], are 
as follows. On average, ACE is over 76 times faster than VE when computing MPEs (see Table 8). In 
addition, ACE can compute all marginals, supporting the probabilistic queries BEL (H, e) (where H £ H) 
and MAPmlv(-H\ e )> using just slightly more than twice the time used for computing MPEs, or MAPmpe(H, 
e). In other words, ACE computes probabilities over 500 random variables more than 33 times faster than 
VE computes probabilities for a single random variable. The third inference system, JTP, can compute all 
marginals in a manner similar to ACE. This overcomes VE’s limitation of computing probabilities for only 
one random variable at a time. Compared to ACE, however, JTP is over 14 times slower and also has a 
standard deviation that is more than 5 times greater. 

The auto-generation approach produces, for ADAPT, a BN that all three systems perform well on. This 
illustrates that the ADAPT BN was carefully generated, using our novel modelling approach and auto- 
generation algorithm, in a manner that supports efficient inference using three quite different exact inference 
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algorithms. Our two next points consider the performance of ACE versus VE and JTP. First, the ACE 
system outperforms VE for MPE computation and JTP for computation of marginals; in fact ACE is one 
or two orders of magnitude more efficient than these two other algorithms. Second, the standard deviation 
is substantially smaller for ACE than for VE and JTP. The fast and predictable inference times of ACE 
are both very important factors for electrical power system health management in the real-time setting of 
aerospace. 

10 Conclusion and Future Work 

In this work, we have discussed an electrical power system application of the probabilistic approach to 
model-based diagnosis. Specifically, we have discussed the use of Bayesian networks and arithmetic circuits 
to perform diagnosis and health management in electrical power systems in aircraft and spacecraft. We 
have emphasized two important issues that arise in engineering diagnostic applications in this area, namely 
the challenges of modelling and real-time reasoning. The modelling challenge concerns how to model a 
real-world EPS by means of Bayesian networks. To address this challenge, we developed a systematic way 
of representing electrical power systems as Bayesian networks, supported by an easy-to-use specification 
language and an auto-generation algorithm. The second challenge, that of real-time reasoning, is associated 
with the embedding of algorithms that solve computationally hard problems, including diagnostic reasoning, 
into hard real-time systems [14, 15]. To address this challenge, we compile Bayesian networks into arithmetic 
circuits, an approach that supports real-time diagnosis in two ways. First, the use of arithmetic circuits 
results in more predictable diagnostic inference times. Second, it results in much faster inference. 

While compilation of Bayesian networks to arithmetic circuits is well-established [37, 39, 38, 40, 17], this 
work further extends the reach of the technology by introducing a high-level EPS specification languages 
from which Bayesian networks are auto-generated, and showing that the combined approach gives strong 
results on a real-world EPS. 

Future directions of work include the following. First, improved modeling of and reasoning with con- 
tinuous behavior, using soft evidence, highly discretized, and/or continuous random variables, along with 
representation using arithmetic circuits for purposes of compilation, would be of great interest. A second 
area of interest is improved modeling of dynamic, transient, and cascading faults along with their integration 
into the compilation approach. Third, it would be very useful to extend the high-level specification language 
to handle (i) novel components and states; (ii) continuous behavior; and (iii) dynamics such as transient 
and cascading faults. Finally, it would be interesting to further investigate sensing issues, including the 
questions of optimal sensor placement as well as the number and types of sensors needed to distinguish 
between different faults. 
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